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Abstract 


Adaptive quadrature is applied to marginal maximum likelihood estimation for item response 
models with normal ability distributions. Even in one dimension, significant gains in speed and 
accuracy of computation may be achieved. 
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In marginal maximum likelihood estimation for models for item responses in which the 
ability distribution is normal, evaluation of the log likelihood and its partial derivatives requires 
quadrature. Gauss-Hermite quadrature is attractive given the normal ability distribution, but this 
method of integration may be less efficient than adaptive Gauss-Hermite quadrature for relatively 
long tests. 

To study this issue, Gauss-Hermite and adaptive Gauss-Hermite quadrature are described in 

TM 

sections 1 and 2. Application to the Rasch and 2PL models is made using data from the Praxis 
series. 

Implications of results for psychometric practice are considered in section 3. On the whole, 
adaptive quadrature appears attractive even for one-dimensional cases. 


1 Gauss-Hermite Quadrature 

Gauss-Hermite quadrature is a classical numerical integration technique based on Hermite 
polynomials (Ralston, 1965, pp. 93-97). It has been applied to marginal estimation for a long 
period of time (Bock & Lieberman, 1970). In general, the Gauss-Hermite approach is applied to 
an integral of the form 

/ OO 

f(x) exp (-x 2 )dx. 

-OO 

The r-point approximation 

r 

I r (f) = ^w k f(x k ) exp(-.Tfc) 
k =1 

is designed so that I r (f ) = 1(f) for any real function / such that, for an integer k < 2r — 1, 
f(x) = x k for all real x. Tables of the x k and w k are readily available both in general works on 
mathematical functions (Davis & Polonsky, 1965, p. 924) and in specialized works on Gaussian 
quadrature (Stroud & Secrest, 1966). An important feature of Gauss-Hermite quadrature is that 
the difference e r (f) = 1(f) — I r (f ) satisfies 


e r (f) 


rl(2n) 1/2 f 2r (y) 

2 r (2r)! 


for some real r/ if / has continuous derivatives f k for 1 < k < 2r. 

In marginal estimation, it is more convenient to consider expressions based on the normal 
density function. The standard formula in calculus for change of variables implies that, for the 
standard normal density function f with value 4>(x) = exp(—x 2 /2)/(27r) 1 / 2 for x in R and for a 
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real function g , the integral 


/ OO 

g{y)4>(y)dy 

-oo 

is equal to 1(f) if f(x) = g( 2 1 / 2 x)/7r 1 / 2 for real x. If y% = 2 1 / 2 x/ c and Vk = Wk/ 7 r 1 / 2 for 1 < k < r, 
then 

r 

Mg) = 22g(yk)v k = i r (f ), 

and J(g) — J r (g ) equals 


fc=i 


Mf) = 


r!(2) 1 / 2 ff 2 r(2 1/2 r?) 

(2r)! 


where is the 2rth derivative of g. It is important in the discussion in this paper to note that 
each Vk is positive, Vk = iv-fc+i, yk = ~yr-k+ 1 , and Ylk=i v k = 1, so that J r (g) is symmetric 
about 0 in the sense that J r (g) = J r (h) if h(x) = g(—x) for all real x. If / is a bounded and 
continuous function on the real line, then e r (f) approaches 0 as r approaches 00 (Breiman, 1968, 
p. 181). More generally, if / is continuous and |/(x)|/[l + |x| fc ] is bounded for real x for some 
real k > 0, then e r (f) approaches 0 as r approaches 00 (Breiman, 1968, p. 164). In addition, if 
$ is the distribution function of a standard normal random variable and if <P r is the distribution 
function such that <F r (x), x real, is the sum of the Vk for integers k, 1 < k < r, such that yk < x, 
then <I> r (x) approaches <h(x) as r approaches 00 (Breiman, 1968, p. 159). 

To illustrate use of Gauss-Hermite quadrature, two simple examples of one-dimensional item 
response models will be considered. Both will be applied to data from the Praxis series in which 
45 items are right-scored for 8,686 examinees. 

In both applications, n examinees and q items are present. For examinee i from 1 to n, Xjj 
is the response code for item j, 1 < j < q. It is assumed that Xij is 1 for a correct response 
and 0 for an incorrect or missing response. The vectors X ? with coordinates X^, 1 < j < q, 
are assumed to be independent and identically distributed random vectors. Let T be the set of 
q- dimensional vectors with coordinates that are 0 or 1, and let S be the set of arrays s with 
nonnegative coordinates s(x), x in T, such that the sum 5^ xer s((a:) = 1. Let p in S be the array 
of probabilities p(x) = P(Xj = x) for x in T, so that p defines the distribution of X, and the log 
likelihood function £ is 

n 

^(p) = 22 logp ^- 

i= 1 

For a nonempty subset T of S, consider a model M that p is in T. The log likelihood £(p) has 
maximum £(T) for p in T, and a maximum-likelihood estimate p in T satisfies t?(p) = £(T). If 
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g is a function on T, then g = g(p) is a maximum-likelihood estimate of g( p). The estimated 
minimum logarithmic penalty per item for prediction of X, by use of a probability vector in T is 

H = —(nq)~ 1 £(T) 


(Gilula & Haberman, 1994, 1995). If p is in T, then H provides an estimate of the entropy per 
item of the responses Xj. 

Associated with each examinee i is an ability variable #*. The 9{, 1 < i < n, are assumed to 
be independent and identically distributed random variables with common distribution function 
D, and, for each i, 1 < i < n, the Xij, 1 < j < q, satisfy the local independence requirement 
that they are conditionally independent given #*. It is further assumed that the pairs (X,, 0 t ) are 
mutually independent. For each examinee i, 1 < i < n, the conditional probability that X t] = 1 
given Oi = 9 is Pj(6) > 0, and the conditional probability that X tJ = 0 is Qj(9) > 0. The item 
logit function Xj of item j, 1 < j < q, is then log (Pj/Qj) (Holland, 1990). The item logit vector A 
is then the (/-dimensional function with coordinate j. 1 < j < q, equal to Xj. 

For (/-dimensional vectors a and b with respective coordinates aj and bj for 1 < j < q. let 

<? 

a'b = J2 ( P b r 
3 =1 


and let 


Then 


V = II Qj = fit 1 + ex P( A i)l '■ 

j= 1 i =1 


p(x) = j V exp(x ; A )dD 


(1) 

( 2 ) 


for all x in r, so that 

n 

p) = log 

i =1 

In this report, one-parameter (1PL, Rasch) and two-parameter (2PL) models are considered. 
In the one-paranreter case, let 1 be the (/-dimensional vector with each coordinate 1. Let Ai be 
the set of functions A such that 


J Hexp(X'A )dD. 


A (9) = a91 — 7 


(3) 


for some common item discrimination a and some (/-dimensional vector 7 with coordinates 7 j, 

1 < j < q, so that 7 j/a is the difficulty parameter for item j, 1 < j < q. Then A is assumed to be 
in Ai. 
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In the two-parameter case, let A 2 be the set of functions A such that 


A (9) = 6a - 7 (4) 

for some (/-dimensional vector a with coordinates aj > 0 for 1 < j < g and some (/-dimensional 
vector 7 with coordinates 7 /, 1 < j < q, so that aj is the item discrimination and 7 j/aj is the 
difficulty parameter for item j. 1 < j < q. Then A is assumed to be in A 2 . 

It is quite common to assume that the common distribution function D of the 9i is the 
standard normal distribution function $, so that 

p(x) = J(V exp(x'A)) (5) 


for all x in T, and 

n 

^(p) = ^logJ(Fexp(X'A)). (6) 

i =1 

For example, in the normal 1PL (normal Rasch) model M\, it is assumed that p is in Tf, 
where T\ consists of arrays p in S such that (1) and (2) hold for some A in Ai and D = $. 
Similarly, in the normal two-parameter model M 2 , it is assumed that p is in X 2 , where T 2 consists 
of arrays p in S such that (1) and (2) hold for some A in A 2 and D = $. 

If Gauss-Hermite quadrature with r points is used to approximate the log likelihood for model 
Mfc, where k is 1 or 2 , then the practical effect is to replace with the model M^ r that p is in 
Tfc r , where T^ r is the set of arrays p in S such that (1) and (2) hold for some A in A^ and D = $ r . 
Thus 


for all x in T, and 


p(x) = J r (V exp(x / A)) 


n 


(7) 


^(p) = ^!°g MV exp(X'A)). ( 8 ) 

2—1 

For model M ^, maximization of the log likelihood corresponds to maximization of a log likelihood 
function for a log-linear model of a frequency table in which not all cells are directly observed, so 
that algorithms developed for such models can be applied (Haberman, 1988). 

In practice, results for model M^r are quite similar to results for the model even for 
integers r of moderate size. Consider the example from the Praxis program. In the case of M 1 , 
the normal Rasch model, r = 20 points suffice for quite accurate results. Differences between 
corresponding maximum-likelihood estimates of the parameters a and 7 j under models M\ r and 
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Mi do not exceed 0.00003 in magnitude, a very satisfactory result, especially given that the 
estimated asymptotic standard deviations for these parameters are at least 0.02. Results for r = 9 
are a bit less accurate but still relatively satisfactory. In this case, the magnitude of differences 
does not exceed 0.007 in the case of a and 0.001 for the jj. To compare estimated minimum 
logarithmic penalties per item, let H} ; be —(nq)~ 1 £(Tk), and let Hkr be — (nq)~ 1 £(Tk r )- For the 
normal Rasch model, Hi = 0.59639, while H\ r is 0.59641 for r = 9 and H\ r is 0.59639 for r = 20. 
Indeed, the difference H\ r — H\ is about 4 x 10 “ 8 for r = 20. 

In the case of the 2PL model, results are somewhat similar, but not quite as accurate for 
the approximations by Gaussian quadrature. For r = 32, differences in maximum-likelihood 
estimates of the cij and 7 j for models M 2 r and M 2 do not exceed 0.00003 in magnitude. For 
r = 20, differences are within 0.0006 in magnitude, and for r = 9, differences are within 0.04 in 
magnitude. In this case, H 2 = 0.59157, H- 2 r = 0.59165 for r = 9, and Hir = 0.59157 for r = 20. 
The difference H 2 r — H 2 is about 4 x 10“' for r = 20 and 4 x 10 “ 8 for r = 32. 

These differences should be kept in perspective. Use of a standard likelihood-ratio chi-square 
test for Mi versus M 2 yields a statistic of about 3,770 on 44 degrees of freedom, so that model 
Mi can hardly hold in any case, and the difference H\ — H 2 of 0.00474 is much larger than any 
differences Hkr — Hk that are encountered here. 

Even in the case of the normal 2PL model, an interaction model (Haberman, 2004) yields a 
value of H of 0.59112. Here the model used assumes that p is in S and 

<? 

logp(x) = T S - + S'Jj) 

3 =1 

for x in T and Y^j= 1 x j = 8 f° r some real t s , 0 < s < q, /3j, 1 < j < q, and 7j, 1 < j < q. Thus 
effects of approximation of integrals by Gaussian quadrature are relatively small compared to 
basic differences between models. 

The results for Gauss-Hermite quadrature are certainly adequate for r = 32 for both examples, 
and they are rather satisfactory for r = 20. The question arises whether one can manage to 
achieve higher accuracy with a smaller number of points by modification of the procedures used 
for quadrature. This change can be especially important in cases such as the 2PL model in which 
simple sufficient statistics are not available to simplify computations. 

One alternative approach to quadrature is to use the approach found in the version of Parscale 
used in the National Assessment of Educational Progress (NAEP). In this method, integration 
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points are = [k — 21)/5 for 1 < k < 41, so that zk ranges from —4 to 4, and weights are 
Uk = dexp(—z|/2) for 1 < k < 41, where 

41 

d- 1 = ^exp(-^/2). 
fe=i 

Let Hat be the distribution function such that, for x real, Djy(x) is the sum of the Uk for all k 
such that Zk < x. Then maximum-likelihood estimates for model M^ are computed in effect 
for the model M^n that A is in A& and D = D at. This approach is a bit less accurate than is 
Gauss-Hermite integration with r = 20. For example, for Miat, the corresponding estimated 
log penalty per item is H\n = 0.59639, but the differences between estimates for a and 7 j 
from model Miat and from model M\ have absolute values as large as 0.0002. Thus the NAEP 
approach does not appear attractive given that better accuracy can be obtained by Gauss-Hermite 
integration with fewer quadrature points. A more promising alternative is adaptive Gauss-Hermite 
quadrature. 


2 Adaptive Gauss-Hermite Quadrature 

Adaptive Gauss-Hermite quadrature has been used in statistical analysis for some time 
(Naylor & Smith, 1982). The key feature involves careful use of distinct linear transformations 
designed for each integral J(V exp(X)A)). The linear transformations vary for each iteration of 
an iterative algorithm. For iteration t > 0, let the maximum-likelihood estimate A of A have an 
approximation A^ with coordinates A jt for 1 < j < q, let 

V t = JJ[1 + exp(Aji)] 

3 = 1 

be the approximation of V that corresponds to A t, let 

p = exp(A^) 
jt 1 + exp(Ajt) 

be the approximation of Pj that corresponds to A jt, and let Qjt = 1 — Pjt . In the case of Mi, 

X t {9) = a t 91 - 7 t 

for an approximation at to the maximum-likelihood estimate of a and an approximation 7 to the 
maximum-likelihood estimate of 7 . Similarly, in the 2PL case, 

A t(0) = da. t - 7 1 
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for an approximation at to the maximum-likelihood estimate of a with coordinates a.jt for 
1 < j < q and an approximation 7 to the maximum-likelihood estimate of 7 . 

Let 

L u = log V t + X • X t + log <j> 


for each examinee i, so that Lu is the logarithm of the probability density of Oi that corresponds 
to the item logit function Af. Observe that for both models M\ and M 2 , Lu is a strictly concave 
function such that Lu(9) approaches —00 as \Q\ approaches 00 . Thus Lu has a unique maximum 
[lu. Under the 1PL model, the derivative L' it of L lt satisfies 

q 

L}j (Hit ) = cit ^ ' Pjt (jin ) Hit = 0, 

3 =1 


and the second derivative of L lt at Hit is 

q 

Litif^it) = ~ 1 — a t ^ ] Pjt{.Hit)Q jti^Hit) < 0- 

3=1 


Under the 2PL model, 

q 

Liti^it) = ^ ^ a jt[Xjj — Pjt(Hit)] — Hit = 

3 =1 


and the second derivative of Li at U is 


q 

LitiHit) = ~ 1 — ^ J a itL > jt{Hit)Qjt{Hit) < 0 . 

3 = 1 

Adaptive quadrature uses the location Hit of the maximum of Lu and on au = l/[— L'^hu)] 1 ^ 2 ■ 
Consider use of a scale change based on the linear function zu such that 


ZitifL) — (9 Hit)/' &it 


for real 6 . Then zu has inverse 


-1 


(*) 


Hit + cr it z 


for real z. Let 

Ajt — Lu(z it ) Ljj (Hu ) log (j), 

so that A u is a rescaled version of Lu that has first and second derivative 0 at 0. Consider p in S 
such that (1) and (5) hold. Let 


g lt = [A - A(]'Xj - log V + log V t , 
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so that 


p(Xj) = J (exp (git + L it )/(j)) 

and 

n 

%) = £ log J(exp(g it + La)/((>). 

i=1 

Let 

hit = g(z it )• 

Then use of standard formulas from calculus for change of variables shows that 

n n n 

£(p) = ElogK) + E^W + Elog^(exp(A lt + M). (9) 

2 =1 i= 1 2=1 

Iteration t proceeds as if £(p) in (9) is replaced by 

n n n 

4(p) = E lo g( a *i) + E + E log ^r(exp(A it + /lit)). 

2=1 2=1 2=1 

The potential advantage of the adaptive approximation is that A# is normally much less variable 
than is Vt exp(X)At), as is evident from consideration of derivatives at 0. In addition, iteration t is 
based on behavior of A for A close to A t . The potential complication is that nu must be found by 
an iterative computation essentially the same as that used to find the posterior mean of an ability 
distribution. 

For both models M\ and M -2 for the Praxis example, it is quite adequate to let r = 9. In both 
cases, maximum-likelihood estimates of parameters are accurate to within 0.00001. The adaptive 
procedure was substantially faster than the ordinary Gauss-Hermite quadrature with 32 points. 
In the 2PL case, on the particular personal computer on which the calculations were made, the 
adaptive quadrature calculations required about 30 seconds, and the ordinary routine required 
about 90 seconds. Even use of r = 5 only results in a modest decrease in accuracy, for parameter 
estimates in the 2PL case remain accurate to about 0.0001. Thus adaptive Gauss-Hermite 
quadrature is quite attractive. It is reasonable to expect that adaptive Gauss-Hermite quadrature 
is increasingly attractive as the number of items increases, so that the an, 1 < i < u, become 
increasingly small. 


3 Conclusions 

Results in this report support the conclusion that adaptive Gauss-Hermite quadrature is quite 
attractive in marginal estimation when a normal ability distribution is assumed. Generalization 
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to multivariate normal ability distributions should be quite feasible. Such generalizations should 
be important in NAEP models that employ such ability distributions. 
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